We are using a two-component hurdle model: first, the model predicts whether a disease will be present (binary), and if present, it predicts the case count (integer). Here we compare the results of a boosted tree model to our baseline model.
Disease Status
disease status confusion matrix
|
.metric
|
desc
|
model
|
full_model
|
|
accuracy
|
proportion of the data that are predicted correctly
|
baseline
|
0.85
|
|
xgboost
|
0.96
|
|
kap
|
similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions.
|
baseline
|
0.45
|
|
xgboost
|
0.88
|
|
sens
|
the proportion of positive results out of the number of samples which were actually positive.
|
baseline
|
0.99
|
|
xgboost
|
0.98
|
|
spec
|
the proportion of negative results out of the number of samples which were actually negative
|
baseline
|
0.36
|
|
xgboost
|
0.90
|
disease status confusion matrix by taxa
|
.metric
|
model
|
birds
|
buffaloes
|
camelidae
|
cats
|
cattle
|
cervidae
|
dogs
|
equidae
|
hares/rabbits
|
sheep/goats
|
swine
|
|
accuracy
|
baseline
|
0.85
|
0.76
|
0.770
|
0.76
|
0.86
|
0.730
|
0.80
|
0.91
|
0.85
|
0.86
|
0.87
|
|
xgboost
|
0.95
|
0.96
|
0.960
|
0.97
|
0.95
|
0.970
|
0.95
|
0.97
|
0.96
|
0.96
|
0.96
|
|
kap
|
baseline
|
0.42
|
0.20
|
0.130
|
0.38
|
0.56
|
0.059
|
0.52
|
0.42
|
0.20
|
0.47
|
0.42
|
|
xgboost
|
0.84
|
0.91
|
0.890
|
0.94
|
0.88
|
0.920
|
0.91
|
0.87
|
0.86
|
0.89
|
0.88
|
|
sens
|
baseline
|
0.98
|
1.00
|
1.000
|
1.00
|
0.99
|
1.000
|
0.99
|
0.99
|
0.99
|
0.99
|
0.99
|
|
xgboost
|
0.97
|
0.97
|
0.970
|
0.98
|
0.97
|
0.980
|
0.96
|
0.99
|
0.98
|
0.98
|
0.98
|
|
spec
|
baseline
|
0.34
|
0.15
|
0.094
|
0.32
|
0.49
|
0.043
|
0.48
|
0.31
|
0.14
|
0.38
|
0.32
|
|
xgboost
|
0.85
|
0.94
|
0.920
|
0.96
|
0.91
|
0.940
|
0.94
|
0.87
|
0.87
|
0.90
|
0.90
|
disease status confusion matrix by continent
|
.metric
|
model
|
Africa
|
Americas
|
Asia
|
Europe
|
NA
|
Oceania
|
|
accuracy
|
baseline
|
0.84
|
0.82
|
0.85
|
0.87
|
0.94
|
0.930
|
|
xgboost
|
0.95
|
0.96
|
0.96
|
0.95
|
NA
|
0.990
|
|
kap
|
baseline
|
0.48
|
0.38
|
0.47
|
0.46
|
0.44
|
0.120
|
|
xgboost
|
0.88
|
0.91
|
0.89
|
0.84
|
NA
|
0.920
|
|
sens
|
baseline
|
0.99
|
0.99
|
0.99
|
0.99
|
0.99
|
1.000
|
|
xgboost
|
0.97
|
0.98
|
0.98
|
0.98
|
NA
|
1.000
|
|
spec
|
baseline
|
0.40
|
0.30
|
0.38
|
0.37
|
0.33
|
0.068
|
|
xgboost
|
0.91
|
0.93
|
0.91
|
0.85
|
NA
|
0.920
|
disease status direction change confusion matrix
|
.metric
|
desc
|
model
|
full_model
|
|
accuracy
|
proportion of the data that are predicted correctly
|
baseline
|
0.850
|
|
xgboost
|
0.960
|
|
kap
|
similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions.
|
baseline
|
0.052
|
|
xgboost
|
0.540
|
|
sens
|
the proportion of positive results out of the number of samples which were actually positive.
|
baseline
|
0.470
|
|
xgboost
|
0.590
|
|
spec
|
the proportion of negative results out of the number of samples which were actually negative
|
baseline
|
0.680
|
|
xgboost
|
0.810
|
Note there are baseline cases where disease status is positive but cases are NA, which are imputed in the model as 0.
disease status direction change confusion matrix by taxa
|
.metric
|
model
|
birds
|
buffaloes
|
camelidae
|
cats
|
cattle
|
cervidae
|
dogs
|
equidae
|
hares/rabbits
|
sheep/goats
|
swine
|
|
accuracy
|
baseline
|
0.850
|
0.760
|
0.77
|
0.760
|
0.860
|
0.7300
|
0.800
|
0.910
|
0.850
|
0.860
|
0.870
|
|
xgboost
|
0.950
|
0.960
|
0.96
|
0.970
|
0.950
|
0.9700
|
0.950
|
0.970
|
0.960
|
0.960
|
0.960
|
|
kap
|
baseline
|
0.064
|
0.032
|
0.04
|
0.025
|
0.042
|
0.0032
|
0.039
|
0.082
|
0.043
|
0.052
|
0.061
|
|
xgboost
|
0.390
|
0.670
|
0.66
|
0.770
|
0.510
|
0.7500
|
0.660
|
0.500
|
0.540
|
0.550
|
0.540
|
|
sens
|
baseline
|
0.440
|
0.580
|
0.55
|
0.570
|
0.430
|
0.5700
|
0.510
|
0.480
|
0.460
|
0.470
|
0.480
|
|
xgboost
|
0.530
|
0.620
|
0.63
|
0.700
|
0.570
|
0.6900
|
0.630
|
0.560
|
0.610
|
0.580
|
0.580
|
|
spec
|
baseline
|
0.690
|
0.660
|
0.67
|
0.660
|
0.670
|
0.6400
|
0.670
|
0.700
|
0.680
|
0.680
|
0.690
|
|
xgboost
|
0.760
|
0.860
|
0.86
|
0.900
|
0.800
|
0.9000
|
0.860
|
0.790
|
0.810
|
0.820
|
0.810
|
disease status direction change confusion matrix by continent
|
.metric
|
model
|
Africa
|
Americas
|
Asia
|
Europe
|
NA
|
Oceania
|
|
accuracy
|
baseline
|
0.840
|
0.820
|
0.850
|
0.87
|
0.940
|
0.930
|
|
xgboost
|
0.950
|
0.960
|
0.960
|
0.95
|
NA
|
0.990
|
|
kap
|
baseline
|
0.036
|
0.025
|
0.059
|
0.09
|
0.065
|
0.039
|
|
xgboost
|
0.540
|
0.590
|
0.550
|
0.49
|
NA
|
0.530
|
|
sens
|
baseline
|
0.450
|
0.470
|
0.470
|
0.46
|
0.430
|
0.610
|
|
xgboost
|
0.560
|
0.580
|
0.600
|
0.58
|
NA
|
0.540
|
|
spec
|
baseline
|
0.670
|
0.670
|
0.680
|
0.69
|
0.680
|
0.700
|
|
xgboost
|
0.810
|
0.830
|
0.820
|
0.79
|
NA
|
0.810
|
disease status variable importance and partial dependency (xgboost only)
## Feature Gain Cover Frequency
## 1: disease_status_lag1 0.776430287 0.054115316 0.033185841
## 2: cases_lag1_missing 0.049485755 0.043642648 0.025073746
## 3: ever_in_country_any_taxa 0.048458073 0.050788831 0.019911504
## 4: disease_status_lag2 0.017900276 0.036354726 0.020648968
## 5: log_human_population 0.013270873 0.044711771 0.135693215
## 6: cases_lag2_missing 0.009267981 0.006659092 0.016961652
## 7: disease_population_wild 0.009183998 0.014212483 0.007374631
## 8: log_gdp_per_capita 0.009140356 0.023998585 0.109882006
## 9: disease_status_lag3 0.006213909 0.034683627 0.016961652
## 10: cases_lag3_missing 0.005740374 0.017600986 0.014749263
## 11: log_taxa_population 0.005057817 0.020693720 0.061209440
## 12: cases_lag_sum_border_countries 0.004211446 0.047838273 0.044985251
## Preparation of a new explainer is initiated
## -> model label : disease status
## -> data : 373145 rows 317 cols
## -> data : rownames to data was added ( from 1 to 373145 )
## -> target variable : 373145 values
## -> predict function : predict
## -> predicted values : numerical, min = 1.106078e-06 , mean = 0.2240309 , max = 0.9998575
## -> model_info : package Model of class: xgb.Booster package unrecognized , ver. Unknown , task regression ( [33m default [39m )
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -0.9984326 , mean = 2.682822e-06 , max = 0.9998673
## [32m A new explainer has been created! [39m